Basic EM workflow 1 (Restaurants data set)

Introduction

This IPython notebook explains a basic workflow to match two tables using py_entitymatching. Our goal is to match restaurants from Fodors and Zagat sites. The datasets contain information about the restaurants.

First, we need to import py_entitymatching package and other libraries as follows:



In [1]:

    
import sys
sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/py_entitymatching/')

import py_entitymatching as em
import pandas as pd
import os



In [2]:

    
# Display the versions
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )









    



python version: 3.5.2 | packaged by conda-forge | (default, Sep  8 2016, 14:36:38) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)]
pandas version: 0.19.2
magellan version: 0.1.0

Matching two tables typically consists of the following three steps:

1. Reading the input tables

2. Blocking the input tables to get a candidate set

3. Matching the tuple pairs in the candidate set

Read input tables

We begin by loading the input tables. For the purpose of this guide, we use the datasets that are included with the package.



In [3]:

    
# Get the paths
path_A = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/fodors.csv'
path_B = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/zagats.csv'



In [4]:

    
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')









    



Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.



In [5]:

    
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))









    



Number of tuples in A: 533
Number of tuples in B: 331
Number of tuples in A X B (i.e the cartesian product): 176423



In [6]:

    
A.head()









    Out[6]:






  
    
      
      id
      name
      addr
      city
      phone
      type
    
  
  
    
      0
      534
      arnie mortons of chicago
      435 s. la cienega blv.
      los angeles
      310/246-1501
      american
    
    
      1
      535
      arts delicatessen
      12224 ventura blvd.
      studio city
      818/762-1221
      american
    
    
      2
      536
      hotel bel-air
      701 stone canyon rd.
      bel air
      310/472-1211
      californian
    
    
      3
      537
      cafe bizou
      14016 ventura blvd.
      sherman oaks
      818/788-3536
      french
    
    
      4
      538
      campanile
      624 s. la brea ave.
      los angeles
      213/938-1447
      american



In [7]:

    
B.head()









    Out[7]:






  
    
      
      id
      name
      addr
      city
      phone
      type
    
  
  
    
      0
      1
      apple pan  the
      10801 w. pico blvd.
      west la
      310-475-3585
      american
    
    
      1
      2
      asahi ramen
      2027 sawtelle blvd.
      west la
      310-479-2231
      noodle shops
    
    
      2
      3
      baja fresh
      3345 kimber dr.
      westlake village
      805-498-4049
      mexican
    
    
      3
      4
      belvedere  the
      9882 little santa monica blvd.
      beverly hills
      310-788-2306
      pacific new wave
    
    
      4
      5
      benitas frites
      1433 third st. promenade
      santa monica
      310-458-2889
      fast food



In [8]:

    
# Display the keys of the input tables
em.get_key(A), em.get_key(B)









    Out[8]:





('id', 'id')

Block tables to get candidate set

Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching. py_entitymatching provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. The user can mix and match these blockers to form a blocking sequence applied to input tables.

For the matching problem at hand, we know that two restaurants with no overlap between the names will not match. So we decide the apply blocking over names:



In [9]:

    
ob = em.OverlapBlocker()
C = ob.block_tables(A, B, 'name', 'name', 
                    l_output_attrs=['name', 'addr', 'city', 'phone'], 
                    r_output_attrs=['name', 'addr', 'city', 'phone'],
                    overlap_size=1, show_progress=False)



In [10]:

    
C.head()









    Out[10]:






  
    
      
      _id
      ltable_id
      rtable_id
      ltable_name
      ltable_addr
      ltable_city
      ltable_phone
      rtable_name
      rtable_addr
      rtable_city
      rtable_phone
    
  
  
    
      0
      0
      1036
      1
      pacific pan pacific hotel
      500 post st.
      san francisco
      415/929-2087
      apple pan  the
      10801 w. pico blvd.
      west la
      310-475-3585
    
    
      1
      1
      1016
      7
      hyde street bistro
      1521 hyde st.
      san francisco
      415/441-7778
      bistro 45
      45 s. mentor ave.
      pasadena
      818-795-2478
    
    
      2
      2
      552
      7
      pinot bistro
      12969 ventura blvd.
      los angeles
      818/990-0500
      bistro 45
      45 s. mentor ave.
      pasadena
      818-795-2478
    
    
      3
      3
      652
      7
      bistro garden
      176 n. canon dr.
      los angeles
      310/550-3900
      bistro 45
      45 s. mentor ave.
      pasadena
      818-795-2478
    
    
      4
      4
      991
      7
      bistro roti
      155 steuart st.
      san francisco
      415/495-6500
      bistro 45
      45 s. mentor ave.
      pasadena
      818-795-2478

Match tuple pairs in candidate set

In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes. This typically involves the following four steps:

Sampling and labeling the candidate set
Train matcher using labeled data
Predict the matches in the candidate set using trained matcher

Sampling and labeling the candidate set

First, we randomly sample 450 tuple pairs for labeling purposes.



In [11]:

    
# Sample  candidate set
S = em.sample_table(C, 450)

Next, we label the sampled candidate set. Specify we would enter 1 for a match and 0 for a non-match.



In [12]:

    
# Label S
G = em.label_table(S, 'gold')

For the purposes of this guide, we will load in a pre-labeled dataset (of 450 tuple pairs) included in this package.



In [13]:

    
path_G = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/lbl_restnt_wf1.csv'
G = em.read_csv_metadata(path_G, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
len(G)









    Out[13]:





450

Train matcher using labeled data

First, we need to create a set of features.py_entitymatching provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.



In [14]:

    
# Generate features automatically 
feature_table = em.get_features_for_matching(A, B)

Next, we convert the labeled data to feature vectors using the feature table



In [15]:

    
# Select the attrs. to be included in the feature vector table
attrs_from_table = ['ltable_name', 'ltable_addr', 'ltable_city', 'ltable_phone',
                    'rtable_name', 'rtable_addr', 'rtable_city', 'rtable_phone']
# Convert the labeled data to feature vectors using the feature table
H = em.extract_feature_vecs(G, 
                            feature_table=feature_table, 
                            attrs_before = attrs_from_table,
                            attrs_after='gold',
                            show_progress=False)

Then, we train the learning-based matcher using the feature vectors. For the purposes of the guide, we will use Random Forest matcher that is included in the py_entitymatching package.



In [16]:

    
# Instantiate the RF Matcher
rf = em.RFMatcher()



In [17]:

    
# Get the attributes to be projected while training
attrs_to_be_excluded = []
attrs_to_be_excluded.extend(['_id', 'ltable_id', 'rtable_id', 'gold'])
attrs_to_be_excluded.extend(attrs_from_table)



In [18]:

    
# Train using feature vectors from the labeled data.
rf.fit(table=H, exclude_attrs=attrs_to_be_excluded, target_attr='gold')

Predict the matches in the candidate set using trained matcher

Now, we use the trained matcher to predict matches in the candidate set. To do that, first we need to convert the candidate set to feature vectors.



In [19]:

    
# Select the attrs. to be included in the feature vector table
attrs_from_table = ['ltable_name', 'ltable_addr', 'ltable_city', 'ltable_phone',
                    'rtable_name', 'rtable_addr', 'rtable_city', 'rtable_phone']
# Convert the cancidate set to feature vectors using the feature table
L = em.extract_feature_vecs(C, feature_table=feature_table,
                             attrs_before= attrs_from_table,
                             show_progress=False)

Next, we predict the matches in the candidate set using the trained matcher and the feature vectors.



In [20]:

    
# Get the attributes to be excluded while predicting 
attrs_to_be_excluded = []
attrs_to_be_excluded.extend(['_id', 'ltable_id', 'rtable_id'])
attrs_to_be_excluded.extend(attrs_from_table)



In [21]:

    
# Predict the matches
predictions = rf.predict(table=L, exclude_attrs=attrs_to_be_excluded,                          
              append=True, target_attr='predicted', inplace=False)



In [22]:

    
predictions.head()









    Out[22]:






  
    
      
      _id
      ltable_id
      rtable_id
      ltable_name
      ltable_addr
      ltable_city
      ltable_phone
      rtable_name
      rtable_addr
      rtable_city
      ...
      city_city_sw
      type_type_jac_qgm_3_qgm_3
      type_type_cos_dlm_dc0_dlm_dc0
      type_type_jac_dlm_dc0_dlm_dc0
      type_type_mel
      type_type_lev_dist
      type_type_lev_sim
      type_type_nmw
      type_type_sw
      predicted
    
  
  
    
      0
      0
      1036
      1
      pacific pan pacific hotel
      500 post st.
      san francisco
      415/929-2087
      apple pan  the
      10801 w. pico blvd.
      west la
      ...
      1.0
      0.000000
      0.0
      0.0
      0.513889
      6.0
      0.250000
      0.0
      1.0
      0
    
    
      1
      1
      1016
      7
      hyde street bistro
      1521 hyde st.
      san francisco
      415/441-7778
      bistro 45
      45 s. mentor ave.
      pasadena
      ...
      2.0
      0.222222
      0.0
      0.0
      0.634199
      7.0
      0.363636
      0.0
      3.0
      0
    
    
      2
      2
      552
      7
      pinot bistro
      12969 ventura blvd.
      los angeles
      818/990-0500
      bistro 45
      45 s. mentor ave.
      pasadena
      ...
      1.0
      0.000000
      0.0
      0.0
      0.424242
      9.0
      0.181818
      -3.0
      2.0
      0
    
    
      3
      3
      652
      7
      bistro garden
      176 n. canon dr.
      los angeles
      310/550-3900
      bistro 45
      45 s. mentor ave.
      pasadena
      ...
      1.0
      1.000000
      1.0
      1.0
      1.000000
      0.0
      1.000000
      11.0
      11.0
      0
    
    
      4
      4
      991
      7
      bistro roti
      155 steuart st.
      san francisco
      415/495-6500
      bistro 45
      45 s. mentor ave.
      pasadena
      ...
      2.0
      0.000000
      0.0
      0.0
      0.424242
      9.0
      0.181818
      -3.0
      2.0
      0
    
  

5 rows × 48 columns

Finally, project the attributes and the predictions from the predicted table.



In [23]:

    
# Get the attributes to be projected out
attrs_proj = []
attrs_proj.extend(['_id', 'ltable_id', 'rtable_id'])
attrs_proj.extend(attrs_from_table)
attrs_proj.append('predicted')

# Project the attributes
predictions = predictions[attrs_proj]



In [24]:

    
predictions.head()









    Out[24]:






  
    
      
      _id
      ltable_id
      rtable_id
      ltable_name
      ltable_addr
      ltable_city
      ltable_phone
      rtable_name
      rtable_addr
      rtable_city
      rtable_phone
      predicted
    
  
  
    
      0
      0
      1036
      1
      pacific pan pacific hotel
      500 post st.
      san francisco
      415/929-2087
      apple pan  the
      10801 w. pico blvd.
      west la
      310-475-3585
      0
    
    
      1
      1
      1016
      7
      hyde street bistro
      1521 hyde st.
      san francisco
      415/441-7778
      bistro 45
      45 s. mentor ave.
      pasadena
      818-795-2478
      0
    
    
      2
      2
      552
      7
      pinot bistro
      12969 ventura blvd.
      los angeles
      818/990-0500
      bistro 45
      45 s. mentor ave.
      pasadena
      818-795-2478
      0
    
    
      3
      3
      652
      7
      bistro garden
      176 n. canon dr.
      los angeles
      310/550-3900
      bistro 45
      45 s. mentor ave.
      pasadena
      818-795-2478
      0
    
    
      4
      4
      991
      7
      bistro roti
      155 steuart st.
      san francisco
      415/495-6500
      bistro 45
      45 s. mentor ave.
      pasadena
      818-795-2478
      0

	id	name	addr	city	phone	type
0	534	arnie mortons of chicago	435 s. la cienega blv.	los angeles	310/246-1501	american
1	535	arts delicatessen	12224 ventura blvd.	studio city	818/762-1221	american
2	536	hotel bel-air	701 stone canyon rd.	bel air	310/472-1211	californian
3	537	cafe bizou	14016 ventura blvd.	sherman oaks	818/788-3536	french
4	538	campanile	624 s. la brea ave.	los angeles	213/938-1447	american

	id	name	addr	city	phone	type
0	1	apple pan the	10801 w. pico blvd.	west la	310-475-3585	american
1	2	asahi ramen	2027 sawtelle blvd.	west la	310-479-2231	noodle shops
2	3	baja fresh	3345 kimber dr.	westlake village	805-498-4049	mexican
3	4	belvedere the	9882 little santa monica blvd.	beverly hills	310-788-2306	pacific new wave
4	5	benitas frites	1433 third st. promenade	santa monica	310-458-2889	fast food

	_id	ltable_id	rtable_id	ltable_name	ltable_addr	ltable_city	ltable_phone	rtable_name	rtable_addr	rtable_city	rtable_phone
0	0	1036	1	pacific pan pacific hotel	500 post st.	san francisco	415/929-2087	apple pan the	10801 w. pico blvd.	west la	310-475-3585
1	1	1016	7	hyde street bistro	1521 hyde st.	san francisco	415/441-7778	bistro 45	45 s. mentor ave.	pasadena	818-795-2478
2	2	552	7	pinot bistro	12969 ventura blvd.	los angeles	818/990-0500	bistro 45	45 s. mentor ave.	pasadena	818-795-2478
3	3	652	7	bistro garden	176 n. canon dr.	los angeles	310/550-3900	bistro 45	45 s. mentor ave.	pasadena	818-795-2478
4	4	991	7	bistro roti	155 steuart st.	san francisco	415/495-6500	bistro 45	45 s. mentor ave.	pasadena	818-795-2478